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A lot of attention has been devoted to multimedia indexing over the past few years. In the literature, we 
often consider two kinds of fusion schemes: The early fusion and the late fusion. In this paper we focus on late 
classifier fusion, where one combines the scores of each modality at the decision level. To tackle this problem, we 
investigate a recent and elegant well-founded quadratic program named MinCq coming from the Machine Learning 
PAC-Bayes theory. MinCq looks for the weighted combination, over a set of real-valued functions seen as voters, 
leading to the lowest misclassification rate, while making use of the voters' diversity. We provide evidence that 
this method is naturally adapted to late fusion procedure. We propose an extension of MinCq by adding an order- 
preserving pairwise loss for ranking, helping to improve Mean Averaged Precision measure. We confirm the good 
behavior of the MinCq-based fusion approaches with experiments on a real image benchmark. 
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1 Introduction 

>: 

ON , Combining multimodal inf ormation is an impo rtant issue in Multimedia and a lot of research effort has been ded- 

icated to this problem (see lAtrev et al. I d201fi for a survey). Indeed, the fusion of multimodal inputs can bring 



complementary information, from various sources, useful for improving the quality of any multimedia analysis 
method such as for semantic concept detection, audio-visual event detection, object tracking, etc. 

The different modalities correspond generally to a relevant set of features that can be grouped into different 
views. For example, classical visual or textual features commonly used in multimedia are based on TF-IDF, bag of 
words, texture, color, SIFT, spatio-temporal descriptors, etc. Once these features have been extracted, another step 
consists in using machine learning methods in or der to build classifi ers able to discriminate a given concept. 



Two main schemes are generally considered Sn oek et al.1 (120051) . In the early fusion approach, all the available 



data/features are merged into one feature vector before the learning and classification steps. This can be seen as 
a unimodal classification. However, this kind of approach has to deal with heterogeneous data or features which 
are sometimes difficult to combine. The late fusion model works at the decision level by combining the prediction 
scores available for each modality. This is usually called multimodal classification or classifier fusion. Late fusion 
may not always outperform unimodal classification. Especially when one modality provides significantly better 
results than others or when one has to deal with imbalanced input features. However, late fusio n scheme tends to 



give better results for learning semantic concepts in case of multimodal video lSnoek et al.l (120051) . Several methods 



based on a f i xed d ecision rule have been proposed for combining cl assifier s such as max, min, product, sum, etc 



Kittler et al.1 d 1998b . Other approaches, often referred to as stacking Wolpert ( 1992), need of an extra learning step. 

In this paper, we address the problem of late multimodal fusion at the decision level with stacking. Let hi be 
the classifier that gives the score associated with the i fh modality for any instance x. A classical method consists in 
looking for a weighted linear combination of the different scores, 

n 



* This work was supported in part by the trench project VideoSense ANR-09-CORD-026 of the ANR in part by the 1ST Programme of the 
European Community, under the PASCAL2 Network of Excellence, IST-2007-216886. This publication only reflects the authors' views. 
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where qi represents the weight associated with hi. It is usually required that < < 1 and q% = 1. This 

linear weighting scheme can be seen as a majority vote. T his approach is wid ely used because of its robustness, 
simplicity and scalability due to small com putationa l costs I Atrev et al. I (l2010l) . It is also more appropriate when 
there exist dependencies between the views IWu et alJ (120041) . An important issue is then to find an optimal way to 
combine the scores. One solution is to use machine learning methods to assess the weights lAtrev et alJ d2010h . From 
a machine learni ng standpoint considering a set of classifiers with a h igh diversity is generally a d esirable property 
Dietterichl (feOOOh . One illustration is given by the algorithm AdaBoost Freund and Schapire ( 1996b . frequently used 



as a multimodal fusion method. AdaBoost weights the classifiers according to different distributions of the training 
data, introducing some diver sity, but requires at least weak classifiers to perform well. Another recent approach 
based on the portfolio theorv lWang and Ka nkanhalli (1201 Oh proposes a fusion procedure trying to minimize some 
risks over the different modalities and a correlation measure. While it is well-founded, it needs to define some 
appropriate functions and is not completely fully adapted to the classifier fusion problem since it does not directly 
take into account the diversity between the outputs of the classifiers. 

We propose to study a new machine learning method, namely MinCq, introduced in lLaviolette et alJ (1201 II) . It 
proposes a quadratic program for learning a weighted majority vote over real-valued functions called voters (such as 
score functions of classifiers). The algorithm is based on the minimization of a generalization bound that takes into 
account both the risk of committing an error and the diversity of the voters, offering strong theoretical guarantees 
on the learned majority vote. In this article, our aim is to show the interest of this algorithm for classifier fusion. We 
provide evidence that MinCq is able to find good linear weightings but also very performing non-linear combination 
with an extra kernel layer over the scores. Since in multimedia retrieval, the performance measure is related to the 
rank of positive examples, we propose to extend MinCq to improve the Mean Average Precision. We base this 
extension on an additional order-preserving loss for verifying ranking pairwise constraints. 

The paper is organized as follows. Section|2]deals with the theoretical framework of MinCq. We extend MinCq 
as a late fusion method in Section[3] Before concluding in Section|5] we evaluate empirically the MinCq late fusion 
in Section|U 



2 PAC-Bayesian MinCQ 



In this section we present the algorithm MinCq of Laviolette et al. Laviolet te et alJ (1201 11) for learning a Q- weighted 
majority vote of r eal-valued functions (e.g. classifier scores). This method is based on the PAC-Bayes theory 
McAlleste^ dl999h . We first recall the setting of MinCq. 



We consider binary classification tasks over a feature space X C W l of dimension d. The label space is 
Y = { — 1, 1}. The training sample is S = {(x^,^)}™^ where each example (xi,yi) is drawn Ltd. from a fixed 
— but unknown — probability distribution V defined over X x Y. We consider a space of real-valued voters %, 
such that VTi, eH, hi : X n> K. Given a voter hi, the predicted label of x S X is given by sign[/ii(x)], where 
sign[a] = 1 if a > and —1 otherwise. Then, the learner aims at choosing a distribution Q over TL — the weights 
qi — leading to the Q-weighted majority vote Bq with the lowest risk. Bq is defined by, 



B Q (x) = sign[iJ c (x)] 

\n\ 



with H Q (x) = y~] qihi(x). 



The associated true risk Rx>(Bq) is defined as the probability that the majority vote misclassifies an example drawn 
according to V, 

Rv(B Q ) =P (Xil(M ,(fl c (x) ±y). 

In the case of MinCq, H has to be a finite auto-complemented family of 2n real-valued voters H = {hi, ... , /i2n} 
such that, 

Vxe X,Vi e {l,...,n}, ft i+ „(x) = -/ii(x). (1) 

Moreover, the algorithm considers quasi-uniform distributions Q over H, i.e. the sum of the weight of a voter and 
its opposite is i 

Vie {!,..., n}, Q(hi) + Q[h i+n ) = q t + q t +i = -. (2) 
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This con straint is not too restr ictive since every distribution over H can be represented by a quasi-uniform dis- 
tribution Laviolette et alj d201 lb . The assumptions (Q~|i and (O are actually an ele gant trick to avoid the use of a 
prior distribution over % which is often required by usual PAC-Bayesian method McAllester dl999h . making the 
algorithm more easily applicable. 

We now present the principle of the algorithm MinCq. The core of MinCq is the minimization of the empirical 
version of a bound — the C-Bound — over the risk of the Q-weighted majority vote. 



Theorem 1 (C-Bound Laviolette et al. (2011)) Given % — {hi, . . . , /i2n} « class of 2n functions, for any weights 
{<li}i=i' i- e - distribution Q, on % and any distribution T> over XxY,ifE iXyV) ^ v H Q (x.) > OthenR v (B Q ) < Cg 
where, 



a 



x> _ Var (x ^ ) ^- D ( 2 /i7 Q (x)) _ (M 



T>\2 



Q E (XiJ/) ^(yIf e ( X ))2 M V Q2 ' 

with Mq = E( Xi . y )^ D YhZi VQihitx), and Mq 2 = E(x,j)~i)Ei"i Si'=i Qi^i'hi{x)hif(x) are respectively the 
first and the second moments of the Q-margin: yiJq(x). 

Following some generalization bounds, MinCq proposes to minimize the empirical version of the C-bound, Cq = 
1 — - , over a sample S. The idea is to fix the empirical first moment A4q to a margin p, > and to minimize 

Q2 

the empirical second moment Mq 2 measuring the correlation of the voters. This leads to minimize the bound and 
thus the risk of the majority vote by taking into account the diversity between the voters. 



Definition 1 (MinCq algorithm Laviolette et al. (2011)) Given a setH = {hi, . . . , h 2n } of voters, a training set 
S = {(xj, yj)}"=i, and a margin fi > 0, among all quasi-uniform distributions Q of empirical margin A4q exactly 
equal to p, the MinCq algorithm consists in finding one that minimizes the empirical -Mq 2 - 

Due to the auto-complemented <£XJ and quasi-uniformity (0 assumptions, the algorithm can be expressed as a 
quadratic program \MinCq\ by only considering the first n voters hi €%. 

argmin Q Q^M S Q - A* S Q, 

a + j_ 

2 2nm 



m n 



s.t. m 4 s Q = f + ^EE vM*i), 



3=1 i=l 

and Mi e {I, . . . , n}, < q l < -, (MinCq) 

n 

where * denotes the transposed function, Q = (qi , . . . , g n )* is the vector of the first n weights qi, Mj is the n x n 
matrix formed by ^ J2jLi hi( x j)hi> ( x j) f° r * an d i' in {1, ... , n}, and, 

/ i 771 i m 

ms = (~ E ^Afo), • • • ' — S tfiMx* 

^ 3=1 3=1 
/ 1 n m 

A s = 



^ Tt til It lit v 7 

EE M x 3)M x i), • • • , EE M x j)M x j) ) 

i—l 7 = 1 i=l 7=1 7 



3 = 

Finally, the Q-weighted majority vote learned by MinCq is then 

B Q (x) = sign[i/g(x)], 

with Hq(x) = E - \ ) h ;(X.). 

i—l 



3 MinCq as a Late Fusion Method 

PAC-Bayesian MinCq has been proposed in the particular context of binary classification where the objective is 
to minimize the misclassification rate of the Q-weighted majority vote by taking into account the diversity of the 
voters. From a multimedia indexing standpoint, MinCq thus appears to be a natural way for late classifiers fusion 
to combine the predictions of classifiers separately trained from different modalities. 

Concretely, given a training sample of size 2m we split it randomly into two subsets S' and S — {(xj, yj)}J^i 
of the same size. Let n be the number of modalities. For each modality i, we train a classifier hi from S'. Let 
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H = {hi, . . . , h n , —hi, . . . , —h n } be the set of the n associated prediction functions and their opposites. At this 
step, the fusion is achieved by MinCq: We learn from S the Q-weighted majority vote over % with the lowest 
risk. However, in many applications, such as multimedia document retrieval, people are interested in performance 
measures related to precision or recall. Since a low-error vote is not necessarily a good ranker, we propose an 
adaptation of MinCq to improve the popular Mean Averaged Precision (MAP). 

We first recall the definition of the MAP measured on S for a given real-valued function h. Let S + = { (xj ,yj) : 
(x.j,yj) E S A fjj = 1} = {(xj+ , 1)}"C =1 be the set of the m + positive examples from S and S~ — {(xj, yj) : 

(xj,yj) E S A yj = — 1} = {(Xj- , — 1)}™ =1 the set of the mT negative examples from S (to + + mT = to). For 
evaluating the MAP, one ranks the examples in descending order of the scores. The MAP of h is, 



MAP s (h) = T ^r Prec@j, 



1 1 r-vj=l 



where Prec@j is the percentage of positive examples in the top j. The intuition behind this definition is that 
we prefer positive exam ples with a score higher than negative ones. To achieve this goal, we propose to learn 
with pairwise preference Fiir nkranz and Hiillermeier (eds)l ( 201dt) on pairs of positive-negative instances. Indeed, 



pairwise methods are known to be a good compromise between accuracy and more complex perfo rmance measure 
like MAP. Especially, the notion of order-preserving pa irwise loss was in troduced in Zhang (2004) in the context of 
multiclass classification. Following this idea, Yue et al. lYue etal.ld2007h have proposed a SVM-based method with 



a hinge-loss relaxation of a MAP-loss. In our specific case of MinCq for multimedia fusion, we design an order- 
preserving pairwise loss for correctly ranking the positive examples. Actually, for each pair (x^-+ , x^- ) E S + x S~, 
we want: Hq(xj+ ) > Hq(xj- ) 4=> Hq(x.j- ) — Hq(xj+ ) < 0. This can be forced by minimizing (according to 
the weights qi) the following hinge-loss relaxation of the previous equation, 

E E [HQ(* j -)-H Q (x j+ )) + , (3) 
j+=ij-=i 

where [a]+ = max(a, 0) is the hinge-loss. In the setting of MinCq, with H auto-complemented (Eq.(Q]i) and Q 
quasi-uniform (Eq.(f2]l), we reduce the term (0 to, 



m m 



1 V V 

m + m~ ^ ^— ' 
j+=ij-=i 



n 

E 

.i=l 



2% 



(hi(xj--) - ftt(xj-+)) 



(4) 



To deal with the hinge-loss of we consider m + xm _ additional slack variables (, s+y(S - = (£j+j-)i<j+< m +,i<j- 
weighted by a parameter j3 > 0. We make a little abuse of notation to highlight the difference with $MinCq\ : Since 
£s+xs- a PP ear on ly m tne linear term, we simply add (0]i after the \MinCq\ formulation. We obtain the quadratic 
program \MinCqpw\ , 



s.t. m^Q - | + ^— E vM*j)> 

3 = 1 i=l 

Vj+ E {!,..., to+}, Vj" E {!,..., TO"}, >0, 



1 

W >-T— E( 2 <Z 



Ul'Ul — ' \ 71 
1=1 



^ (M x j-) ^ ^( x i+)) ) 



and Vi E {1, . . . ,n}, < % < -, (MinCq PW ) 

71 

where Id is the unit vector of size ?7i + x mT . However, one drawback of this method is the incorporation of a 
quadratic number of additive variables (tti + x to - ) which makes the problem harder to solve. To overcome this 
problem, we propose to relax the constraints by considering the average score of the negative examples: We force 
the positive examples to be higher than the average negative scores. This leads us to the following alternative 
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Table 1: MAP obtained on the PascalVOC'07 test sample. On the left, experiments with rbf kernel layer. On the 
right, without. 



concept 




MinCq rbf 


S\M rbf 


MinCqpwav 


MinCqpw 


MinCq 


S 




best 




aeroplane 


0.513 


0.513 


0.497 


0.487 


0.486 


0.526 


0.460 


0.241 


0.287 


0.382 


bicycle 


0.273 


0.219 


0.232 


0.195 


0.204 


0.221 


0.077 


0.086 


0.051 


0.121 


bird 


0.2659 


0.264 


0.196 


0.169 


0.137 


0.204 


0.110 


0.093 


0.113 


0.123 


boat 


0.267 


0.242 


0.240 


0.1593 


0.154 


0.159 


0.206 


0.132 


0.079 


0.258 


bottle 


0.103 


0.099 


0.042 


0.112 


0.126 


0.118 


0.023 


0.025 


0.017 


0.066 


bus 


0.261 


0.261 


0.212 


0.167 


0.166 


0.168 


0.161 


0.098 


0.089 


0.116 


car 


0.530 


0.530 


0.399 


0.521 


0.465 


0.495 


0.227 


0.161 


0.208 


0.214 


cat 


0.253 


0.245 


0.160 


0.230 


0.219 


0.220 


0.074 


0.075 


0.065 


0.116 


chair 


0.397 


0.397 


0.312 


0.257 


0.193 


0.230 


0.242 


0.129 


0.178 


0.227 


cow 


0.158 


0.177 


0.117 


0.102 


0.101 


0.118 


0.078 


0.068 


0.06 


0.101 


diningtable 


0.263 


0.227 


0.245 


0.118 


0.131 


0.149 


0.153 


0.091 


0.093 


0.124 


dog 


0.261 


0.179 


0.152 


0.260 


0.259 


0.253 


0.004 


0.064 


0.028 


0.126 


horse 


0.495 


0.4504 


0.437 


0.3011 


0.259 


0.303 


0.364 


0.195 


0.141 


0.221 


motorbike 


0.295 


0.284 


0.207 


0.1412 


0.113 


0.162 


0.193 


0.115 


0.076 


0.130 


person 


0.630 


0.614 


0.237 


0.624 


0.617 


0.604 


0.001 


0.053 


0.037 


0.246 


pottedplant 


0.102 


0.116 


0.065 


0.067 


0.061 


0.061 


0.057 


0.04 


0.046 


0.073 


sheep 


0.184 


0.175 


0.144 


0.0666 


0.096 


0.0695 


0.128 


0.062 


0.064 


0.083 


sofa 


0.246 


0.211 


0.162 


0.204 


0.208 


0.201 


0.137 


0.087 


0.108 


0.147 


train 


0.399 


0.385 


0.397 


0.331 


0.332 


0.335 


0.314 


0.164 


0.197 


0.248 


tvmonitor 


0.272 


0.257 


0.230 


0.281 


0.281 


0.256 


0.015 


0.102 


0.069 


0.171 


Average 


0.301 


0.292 


0.234 


0.240 


0.231 


0.243 


0.151 


0.104 


0.100 


0.165 



problem \MinCqpwau\ with only m + additional variables. 

argmin Q4s+ Q^M S Q - A^Q + (3 Id*£ s+ , 

s.t. rrigQ 



2 2nm 



J^EX>^ 

Vj+e{l,...,m + }, £ j+ >0, 

1 m ~ n ( 1\ 

& > — — EE 2 *-- - >*(*,■+)) , 



and Vt G {l,...,n}, < qi < —. 

n 



(MinCqpwav) 



where Id is the unit vector of size m + . 

Note that the two approaches still respect the framework of the original MinCq. We simply regularize the search 
of the weights for a Q-weighted majority vote leading to an higher MAP. 

Finally, for tuning the hyperparameters {pi, /?) we use a cross-validation process (CV). Instead of selecting the 
parameters leading to the lowest risk, we select the ones leading to the best MAP. 



4 Experiments 

In this section, we show empirically the interest of MinCq, and our extension, as a late fusion method with stacking 
(imple mented with MOSEK s olver). We experiment the MinCq-based approaches on the PascalVOC'07 bench- 
mark |Ey^ringhamelalJ (12.007), where the goal is a list of 20 visual concepts to identify in images. The corpus is 
constituted of 5000 training and 5000 test images. In general, the ratio between positive and negative examples is 
less than 10%. For each concept, we generate a training sample constituted of all the training positive examples and 
negative examples independently drawn such that the positive ratio is 1/3. We keep the original test set. 

Our objective is not to provide the best results on this benchmark but rather to evaluate if the MinCq-based 
methods could be helpful for the late fusion step in multimedia indexing. To do so, we split the training sample into 
two subsets, S' and S, of the same size. We consider 9 different visual features: 1 SIFT, 1 LBP, 1 Percepts, 2 HOG, 
2 Local Color Histograms and 2 Color Moment s. Then, we train from S' a SVM-classifier for each visual feature 
(with the LibSVM library Chang and Linl (120011) and a rbf kernel with parameters tuned by CV). The final classifier 
fusion is learned from S. 

In a first series of experiments, the set of voters TL is constituted by the 9 S VM-classifiers (MinCq also considers 
the opposites). We compare the 3 linear MinCq methods \MinCq\ , ^MinCqpwh ^MinCqpwav) to the following 
4 baseline fusion approaches. 
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• The best classifier of %: 

hbest = a,Tgmax h . eH MAP s (hi). 

• The one with the highest margin: 

best(x) = argmax^.g^ |ftf(x)|. 

• The sum of the classifiers (unweighted vote): 

E(x) = J2 M*)- 



The MAP-weighted vote: 



MAPsjhj) 

T Jhl ^ H MAP S {h v y 



In a second series, we propose to introduce non-linear information with a rbf kernel layer. We represent each 
example by the vector of its scores of the 9 S VM-classifiers, TL being the set of kernels over the sample S: Each 
x 6 S is seen as a voter fc(-, x). We then compare our method to stacking with SVM tuned by CV (SVM. rb f). Note 
that we do not report the results of ^MinCqpw) in this context, because the computational cost is much higher and 
the performance is lower. The full pairwise version implies too many variables which may penalize the resolution 
of \MinCqpw) - 

In either case, the hyperparameters of MinCq-based methods are tuned with a grid search by a 5-folds CV. The 
MAP-performances are reported on TableQ] we can make the following remarks. 

• On the right, for the first experiments, we clearly see that the linear MinCq-based algorithms outperform on 
average the linear baselines. At least one MinCq-based method produces the highest MAP, except for "boat" 
for which hf, est is the best. We note that the order-preserving hinge-loss is not really helpful: The classical 
\MinCq\ shows the best MAP. In fact, this can be explained by the limited number of voters. 

• On the left, with a kernel layer, at least one MinCq-based method achieves the highest MAP and for 17/20 
both are better than SVM. Moreover, MinCq r p b yy av with the averaged pairwise preference is the best for 17 
concepts, showing the order-preserving loss is a good compromise between improving the MAP and keeping 
a reasonable computational cost. 

• Globally, kernel-based MinCq methods outperform the other methods. Moreover, at least one MinCq-based 
approach is the best for each concept showing PAC-Bayesian MinCq is a good alternative for late classifiers 
fusion. 

5 Conclusion 

We propose in this paper to make use of a well-founded learning quadratic program called MinCq as a novel 
multimedia late fusion method. PAC-Bayesian MinCq was originally developed for binary classifi cation and aims 
at min imizing the error rate of the weighted majority vote by considering the diversity of the voters Laviolette et al. 



(2011). In the context of multimedia indexing, we claim that MinCq thus appears naturally appropriate for late 
classifier fusion in order to combine the predictions of classifiers trained from different modalities. Our experiments 
show that MinCq is a very competitive alternative for classifier fusion. Moreover, the incorporation of average 
order-preserving constraints is sometimes able to improve the MAP-performance measure. Beyond these results, 
such PAC-Bayesian methods open the door to define other theoretically well-founded frameworks to design new 
algorithms in many multimedia tasks such as multi-modality indexing, multi-label classification, ranking, etc. 
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